This report will discuss the developments since last semester on using code-based open-source software for teaching interactive data visualisation. The insights gained from developing an exemplar of interactive techniques applied to data analysis, will form the basis of discussion. Implications for teaching, such as the choice of techniques, software and datasets will then be discussed.
A narrative on how ideas for the exemplar emerged and changed over its development will be outlined and examined with respect to the insights gained in the process.
The motivation to develop an interactive exemplar for cluster analysis emerged from reading literature on the topic, as well as finding “real” data for which it was of interest to see if there was any such underlying structure. Data on the performance of New Zealand schools in the National Certificate of Educational Achievement (NCEA), at the four qualification levels, Level One, Two, Three (L1, L2, L3) and University Entrance (UE), were obtained from the New Zealand Qualifications Authority (NZQA) website. Information on the school decile, region and a “small” cohort warning, were also provided in the data. The decile rating is a measure of the general income level of the families of students attending the school. The socio-economic background of students increases as the decile increase from one to ten. A handful of schools have a decile rating of zero, due to unique circumstances that make them exempt from the socio-economic measure. The achievement rate of a school for each qualification level was quantified in a few ways. The achievement indicator chosen for this analysis was the proportion of students at the school who were successful in obtaining the qualification level, given that they were entered in enough standards to have the opportunity to earn the qualification in the 2016 school year. This is referred to as the “Current Year Achievement Rate” for the “Participating Cohort” by the NZQA (see http://www.nzqa.govt.nz/assets/Studying-in-NZ/Secondary-school-and-NCEA/stats-reports/NZQA-Secondary-Statistics-Consolidated-Data-Files-Short-Guide.pdf).
Generally students attempt L1 at Year 11, L2 at Year 12 and the final two qualifications at Year 13, hence only the achievement rates of students who participated in each qualification at these intended year levels were examined. Furthermore, only schools with achievement indicators across all four qualification levels were retained, thus reducing the dataset to 408 schools from around New Zealand. The focus of analysis will be on its subset of 91 Auckland schools, but the New Zealand dataset of 408 schools will be used to demonstrate how interactive techniques can be useful as the number of observations increases.
The focus of analysis will be on the Auckland subset because it is less affected by the unreliability of small sample sizes. The NZQA indicator of a “small” cohort was at a very low threshold of fewer than five candidates for any qualification level. Being the most populated city in New Zealand, Auckland has many of the larger schools, but there may still be some schools left in the analysis that have less than 30 candidates entered in a qualification level. The first few observations from the data set of 91 schools in the Auckland region are shown below.
## L1 L2 L3 UE Decile
## Al-Madinah School 0.889 1.000 1.000 0.640 2
## Albany Senior High School 0.905 0.904 0.882 0.701 10
## Alfriston College 0.659 0.651 0.563 0.369 2
A question that naturally arises from the NCEA data is whether a school’s performance is related to its decile rating.
The pairs plot in Figure 1 shows the achievement rates at L1, L2 and L3 have a weak relationship with decile rating, but the positive correlation is strong at UE. There appears to be an increasing “lower bound” to achievement rates for L1, L2 and L3, as decile increases, but there is a lot of scatter above this boundary. In the bivariate scatterplots we can also see the spread of achievement rates varying across decile groups. The variation in achievement rates decreases as the decile increases (from one), across the L1, L2 and L3 qualification levels. We can see many schools approaching the maximum 100% achievement rate, for L1, L2 and L3, hence it is not surprising to see their univariate distributions are skewed to the left in Figure 2. The distribution of achievement rates for UE is less skewed and hints at two possible groupings. Furthermore performance across the qualification levels appear to be positively correlated, especially between L1 and L2. Hence the low-dimensional plots indicate non-normality, unequal spread between groups and multicollinearity.
The pairs plot in Figure 2 provided a glimpse into the multivariate distribution of achievement rates across the four qualification levels. The parallel coordinates plot (PCP) shown below in Figure 3 allows us to further compare the multivariate distributions of achievement rates for different decile groups, as well as identify high dimensional clustering and outliers.
The ordering of axes in a PCP greatly affects the quality of the graphical analysis, hence interactivity that enables reordering of axes is recommended (Unwin, 2015). In the case of the NCEA data, the natural ordering of the four qualification levels by difficulty, conincides with the recommendation from Cook and Swayne (2007) to order the axes based on correlation. In addition, Unwin (2015) highlights the layering of colours also needs to be considered carefully, since the last group assigned a colour will dominate the other lines.
The positive relationships previously identified in the pairs plot, should translate to approximately horizontal lines between the parallel axes in the PCP, as opposed to sloped lines or “criss-cross” patterns for negative correlation. The static plot in Figure 3 questions whether the positive relationships hold true for schools with low achievement rates and in some decile groups. In particular, there appears to be a negative relationship between achievement rates at L3 and UE for many schools. The achievement rates for UE are clearly the most variable.
The higher decile schools in Auckland appear to dominate the high achievement rates across all qualification levels, while lower decile schools are less consistent with each other in terms of their performance across the levels. Although there are only 91 observations (lines), it is quite difficult to identify even “ball park boundaries”, on the 11-point decile scale, to distinguish between “higher” and “lower” decile schools when describing possible patterns.
The following two plots demonstrate how alpha blending can help minimise the effects of overplotting as the number of observations increase. It is easier to check whether the patterns identified in Auckland schools extend to the 408 schools across New Zealand, using Figure 5 where alpha blending is applied, rather than Figure 4. The performance of high achieving lower decile schools is less “drowned out” by the dominance of their higher decile counterparts, when alpha blending is used. Figure 5 reveals that there appears to be a group of schools converging at 100% achievement for L3, but have different levels of success at L2 and UE. It would be of interest to compare the achievement rates of these schools across the qualification levels. Similarly, we would be interested in tracking the performance of the school with the lowest achievement rate at L1. The plots suggest the school makes a convincing recovery in performance at L2, but it is impossible to follow the school’s progress beyond L2, due to overplotting. Interactive techniques will be later used to explore these points of interest.
The use of colour and interactive techniques are recommended for maximising the effectiveness of a PCP. Venables and Ripley (2002) argue parallel coordinate plots are “often too ‘busy’ without means of interaction” (p. 315). Figure 6 demonstrates how adding interactive filtering helps to further reduce the problems with overplotting and allows direct comparison between selected decile groups.
One of the strengths of a PCP is in identifying multivariate features, such as outliers (Cook & Swayne, 2007). Figure 3 suggests a school’s performance can be unusual in two ways, either it performs inconsistently to the bivariate correlations between the qualification levels (this was previously noted as more typical in “lower” decile schools), or its achievement rates across the levels are unusual compared to the rest of its decile group. The latter becomes difficult to see for the New Zealand dataset, in the static plots (Figures 4 and 5), even with alpha blending applied. The interactive filtering feature in Figure 6 overcomes the problems caused by overplotting by allowing the user to isolate the distribution of each decile group (via a double-click on the legend). Furthermore the hovering tooltip enables instant identification of outliers by school name and the linked brushing of points on the parallel axes of the PCP allows the acheivement rates of individual schools across all qualification levels, to be identified despite the overplotting.
Hence linked brushing can be used to explore the outlier at L1, previously identified in Figure 5. The tooltip identifies the school and the linked brushing confirms that despite the poor performance at L1, the school had 100% achievement rates at L2 and L3.
Cook and Swayne (2007) describe how linked brushing enables dynamic database querying and direct comparisons between the subsetted data of interest and the remaining observations. Figure 7 illustrates how brushing the point where L3=100% on the interactive PCP, highlights the extent of the variability in performance at the other qualification levels, for schools that have 100% achievement rate at L3. Surprisingly, the UE achievement rates for these schools are just as variable, as it is overall for all New Zealand schools in the data set. Many of these schools also obtained 100% achievement rates at L2, but again there is a surprisingly variable success at L1.
The query made via linked brushing in Figure 7 is similar to subsetting the data as shown below. A summary could be used to examine the performance of the 42 schools across the other qualifications, but making sense of this summary would required comparison with statistics for the whole data set. Furthermore the summary hides the unusual performance of individual schools at certain qualification levels. The interactive techniques allow us to gain insights about the NCEA data, that were not possible from using static plots or were
L3all_achieve <- nzqa[nzqa$L3==1, c("L1", "L2", "UE")]
nrow(L3all_achieve)
## [1] 42
summary(L3all_achieve)
## L1 L2 UE
## Min. :0.1430 Min. :0.5710 Min. :0.1540
## 1st Qu.:0.8407 1st Qu.:0.9073 1st Qu.:0.5000
## Median :0.9260 Median :1.0000 Median :0.7700
## Mean :0.8848 Mean :0.9463 Mean :0.7025
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.9380
## Max. :1.0000 Max. :1.0000 Max. :1.0000
Since queries are done via the plot, the effectiveness of linked brushing will be affected by overplotting. Brushing of PCP can be done via a click in plotly, but cannot also have the functionality of brushing by a “drag-and-select” mouse motion. ggobi allows greater flexibility in the sense of brushing the lines, not only points.
The PCP in Figure 3 suggests that patterns of performance across decile groups, if they exist, are difficult to distinguish at the multivariate level for Auckland schools. Principal component analysis (PCA) provides a way to reveal interesting multivariate structure, through finding projections of the data that show maximal variability (Venables & Ripley, 2002).
A plot of the first two principal components of the Auckland schools dataset is shown in Figure 7. The decile of the schools is represented by the colour and plotting symbol. The axes for the original variables, reflecting the loadings of the principal components, indicate that the first principal component considers the schools’ performance across all four qualification levels, while the second principal component contrasts performance in UE against the remaining qualifications. Not surprisingly the plot shows more spread across the first principal component since it explains a much greater proportion of the variation in the data. The first principal component reveals a division between the majority of schools and a smaller group, that is positioned away from the variable axes shown. We can see the smaller group of schools are decile five and below, except for one decile nine school. There is also a decile ten school that appears to be unusual when examining both principal components. The use of colour highlights the remaining decile nine and ten schools as similar in performance, as weighted by the principal components. On the other hand schools from the other deciles seem to be more spread out from each other.
The principal components plot in Figure 7 provided a view of the multivariate distribution of the Auckland schools dataset that was less easily affected by overplotting than a static PCP. Hence refinements on observations were possible. From the parallel coordinate plots we observed that “higher” decile schools performed more consistently with each other and dominated across the four qualification levels, but it was difficult to quantify the decile “cut off” for such schools. The PCA suggests these are the decile nine and ten Auckland schools, with the exception of two schools. One cannot help but wonder: Who are the two unusual decile nine and ten Auckland schools? The interactive tooltips in Figure 8 instantly satifies this curiosity and hence demonstrates the usefulness of being able to directly identify unusual observations. The linked brushing between visual representations of the PCA model and the original variables, also enable confirmation of previous observations and a better understanding of the PCA model. * closer to variable axes means what? Anything? Better performance at that level? * link to decile plot as well as pcp or pairs plot of performance across the 4 qualifications.
Or pairs plot
Data on school size by year group and ethnicity for 2016, was obtained from the government website, Education Counts (2017).
The minimum cohort size for Year 11, 12 and 13 can be used as indicator of whether the achievement rates are based on small sample sizes. We can see the minimum cohort sizes for
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 38.75 82.50 112.10 143.20 1150.00
## Year.11 Year.12 Year.13 Min.Cohort
## Min. : 0.00 Min. : 0.0 Min. : 1.0 Min. : 0.00
## 1st Qu.: 59.75 1st Qu.: 51.5 1st Qu.: 40.0 1st Qu.: 38.75
## Median : 117.50 Median : 106.0 Median : 83.0 Median : 82.50
## Mean : 150.79 Mean : 135.1 Mean : 119.5 Mean : 112.06
## 3rd Qu.: 200.50 3rd Qu.: 182.8 3rd Qu.: 151.5 3rd Qu.: 143.25
## Max. :1322.00 Max. :1150.0 Max. :1843.0 Max. :1150.00
As expected we see less spread in the PCA plot as the minimum cohort size is increased and the first principal component is able to explain more the variation in the data. Once the minimum cohort size is above 20 the amount of variability continues to decrease but at a slower rate. The biggest drop in unexplained variability occurs when the minimum cohort size is increased from 10 to 20.
The position of the axes remain reasonably consistent with the first principal component being a measure of overall performance and the second component contrasting L1 and L2 preformance against L3 and UE.
The decile patterns previously noted for Auckland schools, where decile nine and ten schools are more consistent with each other in performance than other deciles, appear to hold for New Zealand schools in general as we vary the minimum cohort size considered.
When the minimum cohort size is set at 100, the PCA plot appears quite similar to the plot for Auckland schools but with fewer schools from decile one. We can verify this quickly by interactive drop-down menu that allows us to choose the variable “Region” and brush the group of Auckland schools allows to verify this